Online Regret Bounds for Undiscounted Continuous Reinforcement Learning
نویسندگان
چکیده
We derive sublinear regret bounds for undiscounted reinforcement learning in continuous state space. The proposed algorithm combines state aggregation with the use of upper confidence bounds for implementing optimism in the face of uncertainty. Beside the existence of an optimal policy which satisfies the Poisson equation, the only assumptions made are Hölder continuity of rewards and transition probabilities.
منابع مشابه
Logarithmic Online Regret Bounds for Undiscounted Reinforcement Learning
We present a learning algorithm for undiscounted reinforcement learning. Our interest lies in bounds for the algorithm’s online performance after some finite number of steps. In the spirit of similar methods already successfully applied for the exploration-exploitation tradeoff in multi-armed bandit problems, we use upper confidence bounds to show that our UCRL algorithm achieves logarithmic on...
متن کاملImproved Regret Bounds for Undiscounted Continuous Reinforcement Learning
We consider the problem of undiscounted reinforcement learning in continuous state space. Regret bounds in this setting usually hold under various assumptions on the structure of the reward and transition function. Under the assumption that the rewards and transition probabilities are Lipschitz, for 1-dimensional state space a regret bound of Õ(T 3 4 ) after any T steps has been given by Ortner...
متن کاملNear-optimal Regret Bounds for Reinforcement Learning
For undiscounted reinforcement learning in Markov decision processes (MDPs) we consider the total regret of a learning algorithm with respect to an optimal policy. In order to describe the transition structure of an MDP we propose a new parameter: An MDP has diameter D if for any pair of states s, s′ there is a policy which moves from s to s′ in at most D steps (on average). We present a reinfo...
متن کاملNear-optimal Regret Bounds for Reinforcement Learning Near-optimal Regret Bounds for Reinforcement Learning
This technical report is an extended version of [1]. For undiscounted reinforcement learning in Markov decision processes (MDPs) we consider the total regret of a learning algorithm with respect to an optimal policy. In order to describe the transition structure of an MDP we propose a new parameter: An MDP has diameter D if for any pair of states s, s there is a policy which moves from s to s i...
متن کاملOptimistic posterior sampling for reinforcement learning: worst-case regret bounds
We present an algorithm based on posterior sampling (aka Thompson sampling) that achieves near-optimal worst-case regret bounds when the underlying Markov Decision Process (MDP) is communicating with a finite, though unknown, diameter. Our main result is a high probability regret upper bound of Õ(D √ SAT ) for any communicating MDP with S states, A actions and diameter D, when T ≥ SA. Here, reg...
متن کامل